Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPT] Tail Loop Optimization #1567

Merged
merged 1 commit into from
Feb 12, 2025
Merged

[OPT] Tail Loop Optimization #1567

merged 1 commit into from
Feb 12, 2025

Conversation

briannwu
Copy link
Contributor

@briannwu briannwu commented Jan 17, 2025

details:

  1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB.
  2. Not supported: DTV, SparseGemm.
  3. Reorder load instructions with more vgprs.

Compare:

globalReadMode = 2 -> use more vgpr to reorder GR, waitcnt, v_or_b32 instructions

Before:
/* g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LA+0+0], ..., 0 offen offset:0 // load one buffer value
/
g2l=0, load component 1 /
buffer_load_ubyte_d16 v0, ..., 0 offen offset:1 // load one buffer value
s_waitcnt vmcnt(0)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+0], v[vgprG2LA+0+0], v0 // pack a sub 8-bit with dest
/
g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LA+0+4], ... offen offset:0 // load one buffer value
/
g2l=0, load component 1 */
buffer_load_ubyte_d16 v0, ... offen offset:1 // load one buffer value
s_waitcnt vmcnt(0)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+4], v[vgprG2LA+0+4], v0 // pack a sub 8-bit with dest
...

After:
buffer_load_ubyte_d16 v[vgprG2LA+0+0], ... offen offset:0 // load one buffer value
buffer_load_ubyte_d16 v0, ..., 0 offen offset:1 // load one buffer value
buffer_load_ubyte_d16 v[vgprG2LA+0+4], ... offen offset:0 // load one buffer value
buffer_load_ubyte_d16 v1, ... offen offset:1 // load one buffer value
buffer_load_ubyte_d16 v[vgprG2LA+1+0], offen offset:0 // load one buffer value
...
s_waitcnt vmcnt(10)
v_lshlrev_b32 v0, 0x8, v0 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+0], v[vgprG2LA+0+0], v0 // pack a sub 8-bit with dest
s_waitcnt vmcnt(8)
v_lshlrev_b32 v1, 0x8, v1 // shift left to higher 8 bits
v_or_b32 v[vgprG2LA+0+4], v[vgprG2LA+0+4], v1 // pack a sub 8-bit with dest
...

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
globalReadMode = 3 -> use wider global load instructions
Before:
/* g2l=0, load component 0 /
buffer_load_ubyte_d16 v[vgprG2LB+0+0], ..., 0 offen offset:0 // load one buffer value
/
g2l=0, load component 1 /
buffer_load_ubyte_d16 v51, ..., 0 offen offset:1 // load one buffer value
/
g2l=0, load component 2 /
buffer_load_ubyte_d16_hi v52, ..., 0 offen offset:2 // load one buffer value
/
g2l=0, load component 3 */
buffer_load_ubyte_d16_hi v53, ..., 0 offen offset:3 // load one buffer value
...
s_waitcnt vmcnt(14)
v_lshlrev_b32 v51, 0x8, v51 // shift left to higher 8 bits
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v51 // pack a sub 8-bit with dest
s_waitcnt vmcnt(13)
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v52 // pack a sub 8-bit with dest
s_waitcnt vmcnt(12)
v_lshlrev_b32 v53, 0x8, v53 // shift left to higher 8 bits
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v53 // pack a sub 8-bit with dest
...

After:
buffer_load_dwordx4 v[vgprG2LB+0:vgprG2LB+0+3], v[vgprGlobalReadOffsetB+0], s[sgprSrdB:sgprSrdB+3], 0 offen offset:0 // G -> Reg 0_0_0_0
... (calculate some data to determine how to load the last data)
label_LoadB:
... (jump to specified load tile)
label_LOAD_B0:
label_LOAD_B0_K1:
s_cmp_ge_u32 s11, 1
s_cbranch_scc0 label_MergeB
/* g2l=0, load component 0 */
buffer_load_ubyte_d16 v54, ... 0 offen offset:0 // load one buffer value
label_LOAD_B0_K2
...
label_LOAD_B0_K15:
... (load code)
s_branch label_MergeB
label_MergeB:
... (jump to specified load tile)
label_MERGE_B0:
label_MERGE_B0_K1:
s_cmp_ge_u32 s11, 1
s_cbranch_scc0 label_CheckB_OOB
s_waitcnt vmcnt(0)
v_or_b32 v[vgprG2LB+0+0], v[vgprG2LB+0+0], v54 // pack a sub 8-bit with dest
label_MERGE_B0_K2:
...
label_MERGE_B0_K15:
... (pack code)
s_branch label_CheckB_OOB
label_CheckB_OOB:
...
label_CheckLoopBeginB:
... (calculate size to be loaded and size can be loaded)
label_B0:
... (check if there's other tile should be loaded again due to OOB)
s_cbranch_scc1 label_LoadB // Reload
s_branch label_CheckLoopBeginB // Re check
label_TailGlobalLoadEnd:
s_waitcnt vmcnt(0)

@hcman2
Copy link
Contributor

hcman2 commented Jan 20, 2025

Any brief before/after comparison of the tail loop asm code?

@briannwu
Copy link
Contributor Author

image

image

@briannwu briannwu force-pushed the tail_opt branch 4 times, most recently from ef4242e to 4b4f883 Compare January 20, 2025 07:58
hcman2
hcman2 previously approved these changes Jan 20, 2025
Copy link
Contributor

@hcman2 hcman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Opt. If you can share the performance gain for sensitive sizes, it will be much better.

details:
1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB.
2. Not supported: DTV, SparseGemm.
3. Reorder load instructions with more vgprs.
Copy link
Contributor

@aazz44ss aazz44ss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tailLoopOpt2nd == False) else 3

globalReadMode1st = 3 if tensorParameters1st["isSwizzled"] else globalReadMode1st
globalReadMode2nd = 3 if tensorParameters2nd["isSwizzled"] else globalReadMode2nd
Copy link
Contributor

@aazz44ss aazz44ss Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put more comments about what are globalReadMode1st and globalReadMode2nd here.

@briannwu briannwu merged commit 4a104fd into ROCm:develop Feb 12, 2025
14 checks passed
mahmoodw pushed a commit to mahmoodw/hipBLASLt that referenced this pull request Mar 4, 2025
commit 7883b9d060923a1c3d3ee5f2a0846d6f5d17e6b2
Author: Yen Ong <[email protected]>
Date:   Mon Mar 3 15:21:16 2025 -0600

    Add HostLibraryTests to TensileLite

commit 15222f77b4ddec94db95d653e9f138f56d1fdd43
Author: Slobodan Josic <[email protected]>
Date:   Tue Mar 4 10:41:47 2025 +0100

    Tuning for gfx1201 (#1715)

    * Tuned HHS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.2-3B-Instruct model

    * Tuned HHS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.1-8B-Instruct model

    * Tuned BBS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.2-3B-Instruct model

    * Tuned BBS TN gfx1201 GridBased problem sizes used in vLLM based Meta-Llama-3.1-8B-Instruct model

    * Tuned HHS TN gfx1201 Equality problem sizes used in Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct Q4_K_M.gguf models

    * Tuned HSS TN gfx1201 Equality problem sizes used in Meta-Llama-3.1-8B-Instruct, Meta-Llama-3.2-3B-Instruct and Qwen2.5-7B-Instruct Q4_K_M.gguf models

    * Tuned HHS and HSS TN, NN, NT and TT gfx1201 Equality problem sizes used in Stable-Diffusion-3-Medium model

commit 65df63cdbc7a0b6b2da578603341af791414c00d
Author: Jinp800125 <[email protected]>
Date:   Tue Mar 4 15:30:14 2025 +0800

    Update FP32 TN/NN/NT logic (#1708)

    Co-authored-by: victorwu <[email protected]>

commit 6b674cf72d59bb2fccb80f9c882a9bd0036e5f7f
Author: Ethan <[email protected]>
Date:   Mon Mar 3 22:56:50 2025 +0800

    replace the latest git_info.py

commit cc27366d3dae165adf6d2c77c9f72bfef9f200d0
Author: jichang <[email protected]>
Date:   Wed Feb 26 08:16:56 2025 +0000

    Add scaleC/D into hipblaslt-bench

commit 101eb2e366f9fa62da537c3b90546c8f448958c3
Author: hcman2 <[email protected]>
Date:   Mon Mar 3 11:12:54 2025 +0800

    Fix Sparse Metadata vgpr alignment. (#1711)

commit 402603df7b7347a66b0ed5a2f0c6c8c37a685534
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Feb 28 16:30:17 2025 -0700

    Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx (#1689)

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.15.0 to 1.17.0.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 52244d3c253d34b3a813e2aa2fca60d4f961c4e3
Author: Daine McNiven <[email protected]>
Date:   Thu Feb 27 18:45:10 2025 -0700

    Change value_type of AddressWS in asm header from output type to compute type. (#1610)

commit 0f4c5e89bf4bc604f3ac26d32d194c73db23ef21
Author: Torre Zuk <[email protected]>
Date:   Thu Feb 27 16:28:13 2025 -0700

    reinterpreted structs to match (#1596)

    * struct exposed to external API via reinterpret cast so keep the same

commit a7f7bd0a3866f394535f5977b246fd4d80c47b68
Author: NaveenElumalaiAMD <[email protected]>
Date:   Thu Feb 27 12:05:04 2025 -0700

    Apply -ftemplate-depth=2048 to tensile_host.cpp if GCC_VERSION is less than 7.6.0 to avoid template recursion issue (#1709)

commit bcd40b23b876f7f89220fc9036e51a75f059eff0
Author: KKyang <[email protected]>
Date:   Wed Feb 26 22:55:27 2025 +0800

    Fix pickle copying singleton (#1705)

commit 199f5a352644ecbe9c3764ef5d78a446cf093684
Author: KKyang <[email protected]>
Date:   Wed Feb 26 14:03:49 2025 +0800

    Add log to roctx (#1617)

commit b01040cd7aa53733c7d150a0998cce3f5c5ca5db
Author: jichang <[email protected]>
Date:   Tue Feb 25 10:06:15 2025 +0000

    Dockerfile: build hipblaslt from latest rocm ub22.04 docker image

commit e4bcd8a6625fe9d953711ee8817cc568bd36062a
Author: KKyang <[email protected]>
Date:   Wed Feb 26 13:55:09 2025 +0800

    Add device support for offline tuning tool (#1692)

commit 3791674c0a5bfecd779902fa0012a22f7a7e98d1
Author: KKyang <[email protected]>
Date:   Wed Feb 26 13:54:47 2025 +0800

    Fix compiler error (#1619)

    The amdclang++ treats allocating 0 length array as error in 6.4.

commit f28e499dd1b5982ad75b0a02d836a3b77b119078
Author: KKyang <[email protected]>
Date:   Wed Feb 26 13:46:45 2025 +0800

    Add BSS NT BGRADB for gfx942 80cu (#1613)

    * Limit gwvw for gradient bias if glvw < gwvw

    * Add bbs gradient bias b grid for gfx942 80cu

commit 5c2e8b7ee5fe975976f30a1b8d37bd1da93a5712
Author: briannwu <[email protected]>
Date:   Tue Feb 25 11:31:55 2025 +0800

    [OPT] Reschedule DTV global read if cases with nrc = 2x (#1678)

    Detail:
    Original scheduling way of globalread instructions is to read along
    perp. dim first and then read in coal. dim.
    This commit is to reorder instructions to read in coal. dim first.
    And insert 2 continuous global read instructions if TN and nrc = 2.

    Note: Disabled if isSwizzled. Enabled if coalesced along K dim.

commit 1780bfccf74e37142a9ce8c31b4cad52ccd9d9fc
Author: jichang <[email protected]>
Date:   Tue Feb 4 03:27:07 2025 +0000

    apply clang-format

commit e3f7a48014ede147163dbeda15512cfcf428e393
Author: Sv. Lockal <[email protected]>
Date:   Mon Jan 6 06:53:06 2025 +0000

    Fix compilation with libc++

    `std::vector<char[16]>` works with libstdc++, but fails with libc++.
    The issue was reported to https://github.com/llvm/llvm-project/issues/98734, and according to discussion, is not a part of c++17 (used currently in hipBLASLt).
    This change switches `char[16]` to `std::array<char, 16>`, which is compliant with c++17.

commit 5fdf27e2f5f25db6c73b1005f08036a7e92e1dc0
Author: Ethan <[email protected]>
Date:   Tue Feb 25 00:39:53 2025 +0800

    Tuning: Update STA TN tuning for HHS and F8NHS

    Updated GridBased STA HHS & F8NHS yamls

commit e59a726d608ce3580e05bd5897b869d221a020cd
Author: boringmorning <[email protected]>
Date:   Sat Feb 22 14:23:52 2025 +0000

    Update TN HHS/F8NHS STA logic yamls

commit c8fb6ed6e7b9a18eb420c36dedf6978a82deae6d
Author: AndySu12 <[email protected]>
Date:   Fri Feb 21 22:02:54 2025 +0800

    Update gfx942 HHS NN/TN Equality yamls for 0210_fp16ranking (#1694)

commit a3683f938843f084bef0a986a8c11ee3b6e87574
Author: Feroz <[email protected]>
Date:   Fri Feb 21 07:18:52 2025 -0600

    aquavanjaram942_80cu HHS NN offline tuning improvement with trig_float init (#1691)

commit dc1b6a7371d9ff90e0fa4529424f380575e26e9c
Author: Jinp800125 <[email protected]>
Date:   Fri Feb 21 18:23:00 2025 +0800

    HSS BSS Grid V3 (#1688)

    Co-authored-by: victorwu <[email protected]>

commit 2fc7f66a0b2a90bfe39619d8ef84d2d8b1b3ddd1
Author: jichang <[email protected]>
Date:   Thu Feb 20 01:30:56 2025 +0000

    correct ISA version of yamls from 940/941 to 942

commit 4240a7fb1edaad6c088cb78b19c1732c054bbb13
Author: George Tseng <[email protected]>
Date:   Fri Feb 21 17:34:26 2025 +0800

    Update TN HHS/F8NHS STA logic yamls. (#1693)

commit 6946d82344abafbe3b12118c4345d1f60cd7e9dc
Author: Henry Ho <[email protected]>
Date:   Fri Jan 17 07:40:39 2025 +0000

    add gfx942 BF16 STA freesize

commit 3ac3eb2438c42cedf2b2985e04f34627ef2dd0c3
Author: Henry Ho <[email protected]>
Date:   Fri Jan 17 07:06:02 2025 +0000

    Support BF16 Swizzle

commit 463ab5c7fa1cf7f0c45169f2239a3df5527c0a63
Author: Henry Ho <[email protected]>
Date:   Fri Feb 14 16:19:01 2025 +0800

    fix gfx942 228cu regression

commit 0b4eeeab9e1cc81e047e45a2f2ba533404152bf5
Author: Henry Ho <[email protected]>
Date:   Tue Feb 11 15:51:11 2025 +0800

    update equality to fix gfx942 regression

commit ff4e80f853ff2fb42bbf9587f72380cc7c87471e
Author: xiaohuguo2023 <[email protected]>
Date:   Thu Feb 20 09:37:50 2025 +0000

    update ScheduleIterAlg reject conditions (#1641)

commit 65b8f5af80f591ab9e4f9e585f450f93d0fc3d19
Author: jichang <[email protected]>
Date:   Wed Feb 19 08:01:31 2025 +0000

    hipblaslt-bench: Fix output error of atol,rtol if data is inf

commit 9f3ea970c32d9d0d6895d005bd5963d3b8014ca7
Author: hcman2 <[email protected]>
Date:   Thu Feb 20 14:39:07 2025 +0800

    Release and allocate VGPR resoures in tail loop. (#1586)

    1. Add a VGPR base index definition.
    2. Rearange VGPR index order for further optimization.
    3. Re-allocate VGPR for tail loop.

    Fix 6 potential bugs:
    1. DTV will use the Valu VGPRs which is released in the beginning of tail loop.
    2. BiasSum Valu VGPR should be used when endSum but is released in the beginning of tail loop.
    3. _replaceActBranchLabel() always replaces label without postfix.
       However we should check the label we'd really like to replace with.
    4. For DTL, numVgprG2LAllocated is not set so that it will be
       default=-1.
    5. Fix G2LA vgpr allocation bug for navi3x.
    6. Fix LocalWrite vgpr index bug.

    This change is the first step to optimize the VGPR usage in unroll loop.
    In general, the VGPRs usage in the unrolled loop is dependent from the
    tail. In tail, the VGPR can be used more effectively.

commit bafec92acdb54e1064df22e96c6cb83162f87b9c
Author: wencchen <[email protected]>
Date:   Fri Feb 7 08:34:03 2025 +0000

    fix hardware monitor and clk-sys=-nan

commit fd76d3f4aa4b1e23b9eb590627fc0b54eec8aabc
Author: Ethan <[email protected]>
Date:   Thu Feb 20 09:16:15 2025 +0800

    perf-script: update new data type name (f8n)

    Updated datatype and added info log

commit 0a39d75effba7661e81d5d39ce04de820e0448e3
Author: jichang <[email protected]>
Date:   Tue Feb 18 07:57:59 2025 +0000

    Add Gtests for swizzleA HHS and F8NHS for gfx942

commit e2e68a329024146facb8db4a6f691dcf644298c1
Author: jichang <[email protected]>
Date:   Tue Feb 18 07:54:03 2025 +0000

    Fix: swizzleA problem get non-swizzleA solution from cached library

commit cb0c54121f99848736814c52b72894207f902ac8
Author: who who who <[email protected]>
Date:   Mon Feb 17 13:37:13 2025 +0800

    Env enhancement (#1670)

    * enhance logs

    * add max mt environment variable

commit 9c4c6a06ebcbf32b62517e25091b8209841ff1df
Author: KKyang <[email protected]>
Date:   Mon Feb 17 11:40:02 2025 +0800

    Add BBS/HHS/SSS TN/NN/NT Equality got gfx942_64cu (#1667)

commit 62a1e1112c4a9b513ecf438e483c34f8b9f86fa9
Author: Alex Brown <[email protected]>
Date:   Sun Feb 16 09:51:08 2025 -0700

    F8 codegen support for stream-k (#1580)

commit 8a15ea767c5c8fb5e990312bc7c662b14b48ec18
Author: Vin Huang <[email protected]>
Date:   Tue Feb 11 08:19:05 2025 +0000

    Limit the global read length to depthUM when trying to wider the global read of metatdata

commit 8652497c44c165d49b4421539bf636266f35dc48
Author: Vin Huang <[email protected]>
Date:   Fri Nov 22 12:00:46 2024 +0000

    Fix the packBIdx when HasEccHalf is ture but HasWMMA_V1 is false and scheduleIterAlg = 3

commit 3db8677164baeebb1bc368f731b49542637fd45b
Author: Vin Huang <[email protected]>
Date:   Thu Nov 14 12:18:03 2024 +0000

    correct pack latency for metadata when MIInputPerThreadMetadata is 4

commit 9c28f8e0b52ea7214b4d76f9c449c1ed88d96110
Author: Vin Huang <[email protected]>
Date:   Fri Nov 15 10:14:09 2024 +0800

    [Sparse] Metadata only use one vgpr in current SMFMA instructions, so set isHigh16Bits and isHigh8Bits to False.

commit a99c10047c8be879985dc719ea80e2b4ba1f9504
Author: Vin Huang <[email protected]>
Date:   Fri Nov 15 10:12:39 2024 +0800

    [Sparse] fix the dest vgpr when packing 4 metadata to one vgpr

commit 256da07357e791028255bb3684ba4fc513c9fd38
Author: Vin Huang <[email protected]>
Date:   Tue Nov 5 00:08:59 2024 +0800

    fix using wrong vgprG2L when writing metadata to local memory

commit 778b4a918835db03127b8229249cb330133a12ca
Author: b-shi <[email protected]>
Date:   Sat Feb 15 08:29:08 2025 -0600

    Auto convert ocp f8 to nanoo f8 in hipblaslt-bench on gfx94X (#1654)

    * Convert ocp f8 to nanoo in hipblaslt-bench

commit 7ec4fb940ff9498ed941ffaa1d84b2bb448f6961
Author: Feroz <[email protected]>
Date:   Fri Feb 14 11:33:21 2025 -0600

    aquvanjaram942X Tune BBS TN GEMMs (#1633)

commit 07110b7159333aada4b51c66fe3757a54c5bf7fe
Author: mengzcai <[email protected]>
Date:   Tue Feb 11 09:03:08 2025 +0000

    client flow support fp8 swizzle

    add logic yamls, sample

    add gtest for fp16,fp8 swizzle

    assert1

commit e3fc8e648d95acb4a67e29ba602c7d700cd103eb
Author: Vin Huang <[email protected]>
Date:   Wed Feb 12 01:48:20 2025 +0000

    Store M-Dim and N-Dim's bias and alphaVec information separately.

commit 18e54d769f316018dcd4d116bb47f15d9b47adf3
Author: AndySu12 <[email protected]>
Date:   Fri Feb 14 16:27:52 2025 +0800

    Update gfx942 BBS/F8NBS TN Equality yamls for 70b prefill (#1669)

commit 2ff718770f5d89c4d2ce46b8a660eb86d38c2712
Author: Peter Cheng <[email protected]>
Date:   Thu Feb 13 02:42:20 2025 +0000

    tensilelite: add silu to bias stride test

commit 854109121f2aa6d1e1889f25124b9752bf3ca3c7
Author: Peter Cheng <[email protected]>
Date:   Wed Feb 12 04:33:13 2025 +0000

    change swish epilogue enum value and add EXT suffix

commit ae5fabe608c539df4147e68e6acace89de83b753
Author: Peter Cheng <[email protected]>
Date:   Wed Feb 12 04:16:48 2025 +0000

    remove swish from hipblaslt_all

commit a051961cb2071ddebf5829b494f8eb1a435fe7f2
Author: Peter Cheng <[email protected]>
Date:   Wed Feb 12 02:27:03 2025 +0000

    free resources before return in swish sample

commit 82636c5d4f96228ca91f9b0258895c0961a02cc4
Author: Peter Cheng <[email protected]>
Date:   Wed Feb 12 02:15:51 2025 +0000

    use norm_check instead of unit_check for swish test

commit 1a64f9a493b45229a19e1dde23c913577118dbdf
Author: Peter Cheng <[email protected]>
Date:   Tue Feb 11 08:26:40 2025 +0000

    add swish sample

commit a7040ae87d5c458b80d10e8a170e5c99172a3c04
Author: Peter Cheng <[email protected]>
Date:   Tue Feb 11 07:09:38 2025 +0000

    hipblaslt: enable swish with beta=1 (SiLU)

commit 9f978b6672ae19459cb7121f7268956d13d33520
Author: Peter Cheng <[email protected]>
Date:   Tue Feb 11 06:50:56 2025 +0000

    tensilelite: add swish activation

commit ec5720788f267663fb4f5db61438850fd11800b4
Author: Braden Stefanuk <[email protected]>
Date:   Thu Feb 13 09:14:52 2025 -0700

    Use 4-tuple-triple for clang-offload-bundler (#1658)

commit e4dc7cb08b39c36d89b4d86fb43a71dfc4e0305d
Author: Jeffrey Novotny <[email protected]>
Date:   Thu Feb 13 11:13:07 2025 -0500

    Add hipBLASLt Stream-K documentation (#1642)

    * Add hipBLASLt Stream-K documentation

    * Incorporate suggestions from internal review

    * Adjust opening paragraph

commit f45e5b9b6550467bd79516779aea99a9adc6cae4
Author: Braden Stefanuk <[email protected]>
Date:   Wed Feb 12 17:50:08 2025 -0700

    Move Common.py to module (#1607)

    * feat: no global working path with asserts

    * fix: bad import statement

    * fix: different build_tmp dir in cmake than tensile

    * fix: incorrect pathing in Tensile benchmarking

    * fix: string wrap paths for assert checks

    * fix: ensure all paths are built

    * fix: missing global parameter warning

    * fix: ensure library logic path is built

    * fix: client writer paths

    * fix: use child path for client libraries

    * remove PrintTiming from global params

    * remove another assert

    * restore PrintTiming

    * fix: reviewer comments

    * refactor: use env var for @timing decorator

    * refactor: remove library print debug

    * refactor: remove exit after kernel gen option

    * style: remove commented code

    * refactor: move common to a module

    * feat: remove 'CustomKernelDirectory' global param

    * fix: amd clang version global param

    * fix: second pass on amd clang version

    * fix: imports for bolted on ops scripts

    * chore: remove safety asserts

    * chore: create more separation in Common

    * fix: missing line at eof

    * fix: basic reviewer comments

    * chore: move global dependenct funcs to GlobalParameters.py

    * fix: pass global params to splitArchs

    * style: don't format caps funcs

    * fix: bad import

    * Add TensileLogic program to verify matrix instructions (#3)

    * Remove unused and static global parameters (#5)

    * fix: purge library-print-debug
    * feat: make build paths static
    * feat: remove 'SortProblems' global param
    * feat: remove 'ExpandRanges' global param
    * feat: remove 'WavefrontWidth' global param
    * feat: remove 'ValidateLibrary' global param
    * feat: remove 'EnableHalf' and 'ClientArgs' global params
    * feat: prefer profile decorator over 'Profiler' global param
    * chore: remove 'LibraryPrintDebug' from build_client.yaml
    * feat: remove 'MaxFileName' from global params

    * Update preferred search paths (#4)

    * fix: remove files updated by precommit in merge

    * fix: remove log file

    * Copyright update

    ---------

    Co-authored-by: David Dixon <[email protected]>

commit 7012cbe4ecb66dd2f085ef6abc5a5249020cf81d
Author: Mark Kim <[email protected]>
Date:   Wed Feb 12 09:59:36 2025 -0600

    Gfx942 80cu NN grid update (#1656)

    * gfx942 80cu HHS NN grid update

    8192xNx8192
    10240xNx8192
    29568xNx8912
    8192xNx29568
    where N=1-47

    * gfx942 80cu HHS NN grid update

    8192xNx8192
    10240xNx8192
    29568xNx8912
    8192xNx29568

    Where 1<N<~128 (depending)

    * gfx942 80cu HHS NN grid update

    8192xNx8192
    10240xNx8192
    29568xNx8912
    8192xNx29568

    Extend N -> 8192

commit dee248b42a475a7c6d47124140d3f1936c3c41d2
Author: aliry95amd <[email protected]>
Date:   Wed Feb 12 08:58:56 2025 -0700

    Tune grid-based GFX942_80CU_F8NBS_TN (#1645)

commit a0d4cacc641f544c06c4a7db6bfc147b42e6c544
Author: jichang <[email protected]>
Date:   Tue Feb 11 08:51:34 2025 +0000

    Fix CI errors: don't run layernorm API on un-suppoprted platforms

commit 5c5df386ab03784b1b39bb86ba653dcf03dad209
Author: KKyang <[email protected]>
Date:   Wed Feb 12 19:21:10 2025 +0800

    Refactor DepthU == -1 (#1660)

    1. Remove deepcopy
    2. DIfferent default depthU according to compute data size

commit 4a104fd91ba4ca243f7af26b4105ad1313280850
Author: briannwu <[email protected]>
Date:   Wed Feb 12 16:53:42 2025 +0800

    [OPT] Tail Loop Optimization (#1567)

    details:
    1. Separate tailLoopOpt for A / B: tailLoopOptA / tailLoopOptB.
    2. Not supported: DTV, SparseGemm.
    3. Reorder load instructions with more vgprs.

commit db8e93b4fd6c785d4ce6cac11cc3b0ad613ece8b
Author: Henry Ho <[email protected]>
Date:   Tue Feb 11 23:45:12 2025 +0800

    gfx942 MAF tuning

commit 0cb1d8dd70935e904bc16cc671104b1227982342
Author: jichang <[email protected]>
Date:   Tue Feb 11 09:52:58 2025 +0000

    Remove gfx940/gfx941 from tensilelite

commit e4bb811764bda173963ea332a0a33a31809fa92e
Author: jichang <[email protected]>
Date:   Tue Feb 11 09:40:11 2025 +0000

    Remove 941 custom kernels

commit 833118f1b26dce57553114fc11233b9a41bc2e05
Author: jichang <[email protected]>
Date:   Tue Feb 11 09:21:17 2025 +0000

    Remove gfx940 gfx941 solutions

commit 799ca6ea12938689b3be3afe1c1ec672ed51ae92
Author: jichang <[email protected]>
Date:   Tue Feb 11 09:18:30 2025 +0000

    Remove gfx940,gfx941 from hipblaslt lib

commit 99922e3fac11993947d59d9750bdcbe17996d2e0
Author: Minsu Kim <[email protected]>
Date:   Tue Feb 11 19:23:26 2025 -0500

    gfx942 80cu equality tuning for HHS NN/TN (#1634)

    * gfx942 80cu equality tuning for HHS NN/TN

    * gfx942 80cu equality tuning for HHS NN/TN

commit c46e70074a2a639134450752285702de9b0a3160
Author: KKyang <[email protected]>
Date:   Tue Feb 11 21:43:36 2025 +0800

    Add gfx942 64cu TF32 NN Equality (#1643)

commit 134872438ab51423c70012c4f9d7876e81af180d
Author: Jinp800125 <[email protected]>
Date:   Tue Feb 11 17:13:32 2025 +0800

    restore gfx942 regression (#1638)

    Co-authored-by: victorwu <[email protected]>

commit 918cabd8fba46ef504023667c9ab373d289aa5cf
Author: b-shi <[email protected]>
Date:   Tue Feb 11 01:51:11 2025 -0600

    Fix incorrect local index values in MatchTable.yaml (#1640)

commit ac4278d8c90691422d5afdc98dfa32a4fd6e38bc
Author: jichang <[email protected]>
Date:   Mon Feb 10 04:26:05 2025 +0000

    	Fix CI errors: Remove F8 Compability checking from hipblaslt runtime

commit ede613fef29d521287357851712ec182fd96a1c3
Author: jichang <[email protected]>
Date:   Fri Feb 7 08:32:26 2025 +0000

    Don't treat "no solution" as error in client tools

commit 16a73cac0f11a75a5628c9fb2855fd139c8d3412
Author: KKyang <[email protected]>
Date:   Mon Feb 10 14:48:15 2025 +0800

    Add DepthU == -1 (#1627)

    * Add DepthU == -1

    * Update SolutionStructs.py

commit 3ff258f0bb3f2d8751e0380705741384eac2f2dd
Author: AndySu12 <[email protected]>
Date:   Mon Feb 10 11:55:52 2025 +0800

    gfx942 BBS Equality/GridBased update (#1635)

    * Update gfx942 BBS NN/NT/TN Equality yamls for moe5 training

    * Update gfx942 BBS NN/NT/TN GridBased yamls for grid v3 fix

    * Update gfx942 BBS TN GridBased yaml for genai 0122

commit 0200ac211b4f080ae41be771d046bbec5b902b13
Author: Serge Lu <[email protected]>
Date:   Fri Feb 7 10:46:57 2025 +0000

    Added missing swizzle argument

commit b3904583933d4c76147471f542ce16308d6937ed
Author: Ethan <[email protected]>
Date:   Fri Feb 7 14:25:13 2025 +0800

    feature: DTVB with Swizzling (tensorB)

    * implemented swz-b

    fixed wave_id distribution

    * added pytests

    * fix swizzleB padding

    * add new datatype for swizzledB

commit 294161ce63e4fd8a643056860b1736d26f3537ae
Author: Serge Lu <[email protected]>
Date:   Thu Feb 6 17:18:25 2025 +0800

    Swizzle support for hipBLASLt (#1584)

    * Initial draft for hipBLASLt tensor swizzling

    * Refined tensor-swizzling example

    * Fixed incorrect check of opB and swizzleB

    * Draft version of tensor swizzling A for hipblaslt-bench

    * Added free size logic yamls for swizzle-a for demostration

    * hipblaslt-bench: copy init buffer back to CPU for swizzle case

    * update siwzzleA HHS/HSS

    * clean BiasDataTypeList from swizzle pure gemm yaml

    * update AF0EM,AF1EM and ASEM into swizzle logic yaml

    * Added C++ getHeuristic sample for swizzle-A

    * update HSS STA freesize

    * Amended swizzle example with padding

    * Added weight swizzle and pad example

    * Fixed auto-padding in hipblaslt-bench

    * Updated shared headers in clients folder

    * Renamed enum for swizzled tensor

    * Make swizzleB in hipBLASLt API properly propagate

    * Amended document of swizzle for hipBLASLT API

    * Fixed typo in hipblaslt.h

    * Added swizzle enums and updated API reference

    * Added hipblasLtOrder_t to datatypes doc

    * Updated logics for swizzle-A kernels

    * Update docs/api-reference.rst

    ---------

    Co-authored-by: jichang <[email protected]>
    Co-authored-by: Jeffrey Novotny <[email protected]>

commit f7c39ab7ed77cd23e78a24808e33eb3c1d779cda
Author: jichang <[email protected]>
Date:   Wed Feb 5 07:26:35 2025 +0000

    gtest: add alpha = 2 for dgemm test

commit 3cfe2ae2ed2da284b27bb95508829d9eca9af02b
Author: jichang <[email protected]>
Date:   Wed Feb 5 07:25:07 2025 +0000

    Fix: dgemm result is incorrect if alpha is not 1

commit db9778cd9c5a0786ec3fc9ff881abf92b407dec1
Author: Wayne Huang <[email protected]>
Date:   Thu Jan 16 07:11:31 2025 +0000

    Move max frequency retrieval to the begining and add manual input when error.

commit 24e58eeda0ff8003e1bd3f7d7530209eaa0797bf
Author: b-shi <[email protected]>
Date:   Wed Feb 5 22:34:44 2025 -0600

    Fix f8 related ci errors (#1625)

commit d2fbb96f1d4b5c95ddb9663b083de98c8754285e
Author: who who who <[email protected]>
Date:   Thu Feb 6 11:42:53 2025 +0800

    Optimize generator (#1540)

    * use process pool accelerate process

    * beautify code

    * limit process number

    * update readme

    * fix bugs

    * add missing parentheses

    * add missing almalinux in install script

    * add hint into the script

    ---------

    Co-authored-by: root <[email protected]>

commit c7dd1806ac5a99db97fe21cca0f17f5e5eb5a400
Author: jichang <[email protected]>
Date:   Tue Feb 4 09:58:23 2025 +0000

    Fix: incorrect ldd

commit 104249ea2729eb980d8e9442d5301df579b28dae
Author: Hao-Sheng Chen <[email protected]>
Date:   Fri Jan 24 06:23:06 2025 +0000

    Optimize type checking

commit fff2bc21553546f409275ab2f94cfb17d1948134
Author: Peter Cheng <[email protected]>
Date:   Fri Jan 24 14:19:16 2025 +0800

    reject SingleBuffer + GSU = 1

commit a19c66ba1f8a0ce2bb22b1e62dc7cf9642e0b4bc
Author: Torre Zuk <[email protected]>
Date:   Mon Feb 3 11:35:26 2025 -0700

    clear normal operation hip error code (#1605)

    * backward and forward hip compatible

commit 1f54baff121f23e1fadf4d049e0a12c68a58fcef
Author: Torre Zuk <[email protected]>
Date:   Mon Feb 3 00:44:58 2025 -0700

    Fix compiler warnings (#1601)

commit 6fb17a6990009d5320e7de5c7adc14be8aaa341f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Jan 30 10:17:59 2025 +0000

    Bump rocm-docs-core from 1.13.0 to 1.15.0 in /docs/sphinx

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.13.0 to 1.15.0.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.13.0...v1.15.0)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit 36a97b206f548effbfbf0e7ea23652779b3ea5b5
Author: David Dixon <[email protected]>
Date:   Fri Jan 31 20:37:24 2025 -0700

    Additional TCL diagnostics (#1608)

commit 21ec8f9ffedae673a25edb5308aaf983dd5b08f7
Author: NaveenElumalaiAMD <[email protected]>
Date:   Fri Jan 31 10:07:47 2025 -0700

    Add extended profile logging along with flush and rotating size  (#1389)

    * Add extended profile logging and add flush, rotating size to logging

    * clang-format

    * Add flush and rotating data to logging

    * Minor changes

    * Address Torre's comments

    * Doc update with name changes

    * Add cold and hot iters count to bench/profile/extended_profile and some cleanup

commit c107a62d5920bd6b41e42bd4e84ec91e9fc74e9c
Author: Braden Stefanuk <[email protected]>
Date:   Fri Jan 31 09:17:23 2025 -0700

    Remove global working path (#1546)

    * feat: no global working path with asserts
    * fix: different build_tmp dir in cmake than tensile
    * fix: incorrect pathing in Tensile benchmarking
    * fix: ensure all paths are built
    * fix: missing global parameter warning
    * fix: ensure library logic path is built
    * fix: client writer paths
    * fix: use child path for client libraries

commit d9063a8a924d46170c7bd9cad460a02e6dc92ced
Author: b-shi <[email protected]>
Date:   Thu Jan 30 16:16:24 2025 -0600

    Make global soln indices deterministic (#1599)

    * Make global soln index values deterministic

commit 675d0e335a58f6ee8cb48649e19c28076b26ea7e
Author: David Dixon <[email protected]>
Date:   Thu Jan 30 09:08:21 2025 -0700

    Remove Literal for python36 (#1602)

commit f414c101cd294d81f4a272478ba70ac390164532
Author: Torre Zuk <[email protected]>
Date:   Wed Jan 29 09:27:24 2025 -0700

    hipblas-common INTERFACE (#1600)

    * hipblas-common INTERFACE

    ---------

    Co-authored-by: Zuk <[email protected]>

commit 87e165a823399758d944191e94208e1d4cfd4314
Author: David Dixon <[email protected]>
Date:   Wed Jan 29 08:44:38 2025 -0700

    Restore build_tmp removal (#1595)

    * Restore build_tmp removal

commit 832f8e292201380a6b8c72edb1ee92160fac3cae
Author: David Dixon <[email protected]>
Date:   Wed Jan 29 08:43:49 2025 -0700

    Hotfix lazy library config (#1598)

    * Fix no-lazy-library-loading option
    * Update host code to account for library naming

commit cffa30e536e8076acd6e2112928da29f74f1abc8
Author: b-shi <[email protected]>
Date:   Tue Jan 28 19:35:03 2025 -0600

    Incorporate new F8 design (#1577)

    Incorporate new F8 design
     - Added new f8 headers (with backward compatibility)
     - Create distinct f8 data types for NANOO and OCP in Tensilelite
     - Keep default naming for F8 OCP and add extension in default name for NANOO
    ---------
    Co-authored-by: Majed Sujon [email protected]
    Co-authored-by: brian [email protected]
    Co-authored-by: David Dixon [email protected]

commit 976bb4880ff60e37ce40fce0b7fbb767a69a58b1
Author: Pieter Ghysels <[email protected]>
Date:   Tue Jan 28 07:49:41 2025 -0800

    Fix device initialization 2^32 element limitation (#1552)

    * Fix device initialization 2^32 element limitation

    * Do not assume padding for the last block in the batch

commit 13ccd58c5804ee5f2928639d9fdb09f3627738c4
Author: Pieter Ghysels <[email protected]>
Date:   Tue Jan 28 07:49:07 2025 -0800

    Modify trig initialization on device to remove dependency on lda. (#1543)

commit 403cb3900542d2a0462c6a8aa24f9ee2ba38b43c
Author: KKyang <[email protected]>
Date:   Mon Jan 27 22:50:13 2025 +0800

    Remove redundant code for gwvw > 1 route (#1573)

commit 258a2162fbf63bab3f846e3227e6155eea38a5a7
Author: AndySu12 <[email protected]>
Date:   Fri Jan 24 23:43:41 2025 +0800

    Update gfx942 BBS NN/NT/TN Equality yamls for training sizes (#1594)

commit 0e1efffe53661da051a5167d687e84b4720e2e82
Author: root <[email protected]>
Date:   Wed Jan 22 04:44:13 2025 +0000

    Update Gridebase v3 F8 TN

commit 1a84965d8c3ec66c8f609fc56f1c7a56785e74b6
Author: victorwu <[email protected]>
Date:   Thu Jan 23 15:01:37 2025 +0000

    Update Gridebase v3 BBS NN/NT/TN

commit 381443e630bb09c406ffddc8dca19a65340f3721
Author: victorwu <[email protected]>
Date:   Thu Jan 23 15:00:34 2025 +0000

    Update Gridebase v3 HHS NN/NT/TN

commit 7b9e2dba6c7919a0c3d73d021fda6d74458878b5
Author: mengzcai <[email protected]>
Date:   Tue Jan 21 05:39:48 2025 +0000

    Support other types for Swizzling

commit a11ccf64efcd818106dbe37768f69dfcc0a7ff22
Author: smalekta <[email protected]>
Date:   Wed Jan 22 15:36:09 2025 -0500

    added eq tuning for gfx942 F8HS TN (#1579)

commit d6d32e2c10235bc0c35eefd1623daf68c41071d6
Author: Parth Kumar <[email protected]>
Date:   Wed Jan 22 10:39:48 2025 -0600

    Added kernels for HHS & BBS (#1576)

commit d3e76ac89ac8165169f1b2945c63120cd11a4fda
Author: Alex Brown <[email protected]>
Date:   Wed Jan 22 09:08:52 2025 -0700

    Stream-k libs for CPX mode (#1568)

commit 167eb6b08c359091688a7d1e4db012d582746598
Author: jichang <[email protected]>
Date:   Mon Jan 20 06:39:35 2025 +0000

    hipblaslt-bench: throw error if c_type is not equal to d_type

commit bec6dca8fd54144a09b69d3c92c79ab4f6dff46f
Author: Serge Lu <[email protected]>
Date:   Mon Jan 20 13:58:10 2025 +0800

    [TensileLite] Support arbitrary M & K for swizzle-A kernels (#1558)

    * Implemented auto-padding mechanism for swizzle-A

    * Support arbitrary M and K for swizzle-A

    * Fixed tail loop for swizzle-A for arbitrary M & K

    * Added edge test cases for swizzle-A

    * Restore part of tail loop optimization for swizzle-A

    * Re-enabled optimization of tail loop for swizzle-A

    * Fixed segfault when using GuardPageBack for swizzle-A

    * Removed size hacks for swizzle-{A, B}

    * Removed guard-K of A for tail loop of swizzle-A kernel

commit 48b21d267a2797b01d37e8769a808776bc50b665
Author: Josh Chang <[email protected]>
Date:   Mon Jan 20 09:39:53 2025 +0800

    Optimize preloop by v_lshl_add (#1564)

commit 01839feac5d36dbfe4970c82250ecb6d2af4f6ed
Author: David Dixon <[email protected]>
Date:   Fri Jan 17 11:57:51 2025 -0700

    Factor out argument parsing in TensileCreateLibrary (#1514)

    * Make TensileCreateLibrary a directory and add TCL specific ParseArguments module
    * Add exported functions to module
    * Add reasonable defaults
    * Update cmake support
    * Update docs string

commit 38efb62a7bad945dc41535e4ce9fe0f5e3530996
Author: Ethan <[email protected]>
Date:   Fri Jan 17 12:16:59 2025 +0800

    code-gen: Allowed WaveGroups be distributed along n-dim for DTVA/SwizzledA

    * Allow WaveGroup in N-dim for swizzledA

    * directly modify totalElementsCoal/PerpA

    * restore assertion and rejection

    * Added DTVB & updated pytests

    * Fix TLU=True case

commit 0357bb47e653a1d37bfdfd1101a391fb12d04243
Author: smalekta <[email protected]>
Date:   Thu Jan 16 17:02:20 2025 -0500

    added equality tuning for F8HS_TN (#1554)

commit 1be8e65306ce9834e7c361bd1ab140e9da9fe122
Author: Braden Stefanuk <[email protected]>
Date:   Thu Jan 16 09:40:38 2025 -0700

    Fix default code object version (#1553)

commit 15c618e244b58a2a5f16c5cb63add600ed61c805
Author: Henry Ho <[email protected]>
Date:   Thu Jan 16 22:49:21 2025 +0900

    gfx942 BBS F8B8BS F8BS equality tuning (#1551)

    * fix BBS/HHS NN equality yaml

    * equality tuning BBS F8B8BS F8BS

commit cf68a1b63638a2b3d9ebe748b56bdb0e37e056ea
Author: Josh Chang <[email protected]>
Date:   Thu Jan 16 16:35:50 2025 +0800

    fix type in install.sh (#1560)

    Co-authored-by: root <[email protected]>

commit eb00cc86c1101eeaa07d03e2e120c2fe507183e3
Author: Josh Chang <[email protected]>
Date:   Thu Jan 16 16:20:20 2025 +0800

    Install msgpack dependency for CentOS8 (#1559)

    Co-authored-by: root <[email protected]>

commit e5426e879d529c6e76b7eba77e96b6275a17139a
Author: Josh Chang <[email protected]>
Date:   Thu Jan 16 11:44:54 2025 +0800

    Check destination folder with yaml attribute while merging (#1555)

    1. Only check Equality and GridBased folders
    2. Can't prevent manually changed yaml attribute

commit 0ac3cb18b1f7bd98eb0569f777778fa7bb697421
Author: KKyang <[email protected]>
Date:   Thu Jan 16 09:42:23 2025 +0800

    Use B64 instead of B32 (#1548)

commit 305a144636584e08b39e114c5af6c513fa4b69fa
Author: hcman2 <[email protected]>
Date:   Wed Jan 15 18:38:03 2025 +0800

    Update BBS NN/NT/TN Equality yamls. (#1549)

commit bb0cce368f5f86d40169d40cb7fe2090ae8dd0e7
Author: jichang <[email protected]>
Date:   Mon Jan 13 06:04:22 2025 +0000

    hipblaslt-bench: only print device caps of target device

commit ce603f2fb3286953388f4a7d71f4fcb67ec20d1f
Author: KKyang <[email protected]>
Date:   Tue Jan 14 13:06:04 2025 +0800

    Fix if returned index exceeds 32-bit in TensileLite client (#1538)

commit 579164a8f2bd6f9ebc7161a7e8bb7c9c598cc593
Author: Henry Ho <[email protected]>
Date:   Mon Jan 13 08:30:00 2025 +0000

    xf32 TN/NT/equality bmm gridbased update

commit d8edc091eddcd5ec6af80ef54f9e0abc01e0cf3b
Author: Josh Chang <[email protected]>
Date:   Tue Jan 14 10:46:56 2025 +0800

    Equality Tuning: F16 F32 NN TN TT (#1542)

commit e01241ab96209555bdfab3e70e00e039dd81a99e
Author: Jinp800125 <[email protected]>
Date:   Tue Jan 14 10:34:51 2025 +0800

    SyncUp Gridebase HHS from BBS (#1504)

    Co-authored-by: victorwu <[email protected]>

commit 67f4da618ec51de724df283c0af0caa0cc567f06
Author: wencchen <[email protected]>
Date:   Wed Jan 8 22:17:09 2025 -0500

    fix frequency monitor

commit 7e0e4be3045dd075af0d4f4893cca49267658f9a
Author: George Tseng <[email protected]>
Date:   Mon Jan 13 11:52:05 2025 +0800

    Add emulation smoke/regression/extended tests. (#1533)

    * Update smoke/regression tests.

    * Update smoke_gtest.yaml

    * Add extended tests.

commit 7d25c31fd2b386f38b4d99910b456561c11afc1a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Jan 10 17:09:07 2025 -0700

    Bump rocm-docs-core from 1.12.1 to 1.13.0 in /docs/sphinx (#1522)

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.1 to 1.13.0.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.1...v1.13.0)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 94dc2948da4e9602918a15e899401afb9594d2fe
Author: David Dixon <[email protected]>
Date:   Fri Jan 10 16:12:56 2025 -0700

    Feature/refactor compile and link fxns (#1507)

    * add toolchain files
    * Remove build commands
    * Improvements to build algorithm
    * Don't build duplicates
    * Copyright

commit c77a900089a891e82739e95e7b66f85134dc237e
Author: wenchuanchen <[email protected]>
Date:   Fri Jan 10 22:34:59 2025 +0800

    fix f32 mac ldspad=-1 (#1530)

commit 1a2b9f0bdf34be40432bda244c5f06b0762ec7d5
Author: Henry Ho <[email protected]>
Date:   Wed Jan 8 11:46:52 2025 +0800

    fix incorrect gridbased logic

commit 7b9972cdc97ea553a6e159fe267faf8995efc62a
Author: hcman2 <[email protected]>
Date:   Fri Jan 10 14:53:11 2025 +0800

    Update NN NT TN Equality yamls. (#1532)

commit 5ff31e7e72dfc5f8ccb7aeb7bc2ad4f652309049
Author: AndySu12 <[email protected]>
Date:   Fri Jan 10 14:52:01 2025 +0800

    Update gfx942 F8B8BS/F8BS TN Equality yamls for models (#1534)

commit 2b26942ad8ffbf37969efcfa148cdbe3698aebb2
Author: KKyang <[email protected]>
Date:   Fri Jan 10 11:22:54 2025 +0800

    Use v_pk_mul_f32 for short store alpha calculations (#1524)

commit 2dc6dacffc8ca8159485a18ab13f28d6ecc7d2b6
Author: hcman2 <[email protected]>
Date:   Thu Jan 9 11:02:16 2025 +0800

    Update TN F8B8BS equality yaml. (#1527)

commit cd917e0b7cf6944b241d6cb9545d22547d06107f
Author: Parth Kumar <[email protected]>
Date:   Wed Jan 8 19:09:06 2025 -0600

    Added kernels for HHS BBS (#1473)

commit 9b2ff15b1d7d03214f3894067902921faf11a57c
Author: Serge Lu <[email protected]>
Date:   Wed Jan 8 18:17:33 2025 +0800

    Added multiple devices support for matrix transform (#1338)

    * Added multiple devices support for matrix transform

    * Restore selected device for multi-device test of matrix transform

commit 62597c0a057f3c6c4d3dcdb4e415c2e306ef3109
Author: Josh Chang <[email protected]>
Date:   Wed Jan 8 14:02:55 2025 +0800

    gfx942_80cu BBS NN NT Tuning Release (#1526)

    1. Large K and large N cases

commit 14b2226b0281d9d6c72dfa15e8a89a1dcdf4c4d0
Author: KKyang <[email protected]>
Date:   Wed Jan 8 09:42:58 2025 +0800

    Use v_pk_mul_f32 for long store alpha multiplications (#1517)

commit d61fc8c333c636fcf3b77c52d1e59e49e274d350
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Jan 7 12:29:52 2025 +0800

    Bump rocm-docs-core from 1.11.0 to 1.12.1 in /docs/sphinx (#1511)

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.1.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.1)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 78ec86229999aea66e86d70cf17ec1f6a11d0f87
Author: jichangjichang <[email protected]>
Date:   Tue Jan 7 11:52:34 2025 +0800

    update lib version as 0.13 (#1508)

commit e03b1e964d8f0774fe49d176096b418f93b577c8
Author: Jeffrey Novotny <[email protected]>
Date:   Mon Jan 6 22:20:22 2025 -0500

    Update license file for 2025 (#1519)

    * Update license file for 2025

    * Add current year as end date instead of changing the current year in copyright

commit d4c158fdf1eb990fb1f12f5c5172b72c6619c048
Author: hcman2 <[email protected]>
Date:   Tue Jan 7 08:29:31 2025 +0800

    Update TN F8BS Equality yaml. (#1518)

commit 422b5e06d472aea54d6d9238ed51670bdf5ef845
Author: David Dixon <[email protected]>
Date:   Mon Jan 6 09:42:17 2025 -0700

    Remove merge files options from cmakelists (#1513)

commit bafc3108fcc5d3793c115ea8d748a82e0168078d
Author: KKyang <[email protected]>
Date:   Mon Jan 6 15:53:49 2025 +0800

    Create README.md for TensileLite (#1466)

commit c1034284f4fd43b4adbc587d1adaabde68fa3989
Author: AndySu12 <[email protected]>
Date:   Mon Jan 6 15:28:54 2025 +0800

    Update gfx942 BBS/F8F8S NN/NT/TN Equality yamls for models (#1510)

commit 7ad3a47ed3898c60d438399130ca62e1e8ad4751
Author: KKyang <[email protected]>
Date:   Mon Jan 6 15:23:44 2025 +0800

    Update BBS NN/NT/TN equality tuning for gfx942_80cu (#1467)

commit 79d146710dd05045e6436df5408f43fa5227454a
Author: victorwu <[email protected]>
Date:   Thu Jan 2 05:51:31 2025 +0000

    TF32 NN NT TN

commit e4d2bc48af711ac31dcc8575b3e6534458b3dc28
Author: KKyang <[email protected]>
Date:   Mon Jan 6 10:17:04 2025 +0800

    Add BBS support for find_exact.py (#1465)

commit d6449b6f17868f0d5801651cc85e237c345f017b
Author: hcman2 <[email protected]>
Date:   Fri Jan 3 15:26:00 2025 +0800

    Update TN BBS Equality yaml. (#1505)

commit 0aa831f0bc756dddda44656aa1360e2136256c7b
Author: George Tseng <[email protected]>
Date:   Fri Jan 3 14:25:23 2025 +0800

    Disable non-supported features for DTVA. (#1501)

    * Disable non-supported features for DTVA.

    * Update for tail loop cases.

    * Update to UseSgprForGRO: [0].

commit 1b0b1c5a3c5582897dc3ef13f0dae5795c819aab
Author: David Dixon <[email protected]>
Date:   Thu Jan 2 17:07:52 2025 -0700

    Remove merge-files option (#1407)

    * Remove merge-files option
    * Update yaml files in tests
    * remove merge files from build commands

commit 95b1d7bfede4ee808b26249cf54012ba6ac7c915
Author: cm chen <[email protected]>
Date:   Wed Dec 18 04:42:15 2024 -0800

    gfx12 - enlarging tolerance is not needed

commit 964b4f3a17691d054518c21a974160cb8e192ca6
Author: hcman2 <[email protected]>
Date:   Thu Jan 2 09:59:51 2025 +0800

    Update NN/NT BBS Equality yamls. (#1498)

commit 582a9889af6513c6eb813b8df5e5983b01cea892
Author: jichang <[email protected]>
Date:   Mon Dec 30 10:03:55 2024 +0000

    update changelog

commit 5400c61b7ece89d83c964e08ce491c462d481f54
Author: jichang <[email protected]>
Date:   Mon Dec 30 07:00:40 2024 +0000

    Fix: deps folder is incorrect

commit 4d6dc03f410ee1f2d7ee32f0e2e6ad61d5e4a294
Author: Daine McNiven <[email protected]>
Date:   Mon Dec 30 09:01:07 2024 -0700

    Add reject states for failing streamk params (#1425)

    * Add reject states for failing streamk params.

    * Add scheuldeIteralg == 1 as valid for streamk.

commit 3d5765fd0b24d215f8d762aeffdd6f855081e276
Author: AndySu12 <[email protected]>
Date:   Fri Dec 27 23:58:02 2024 +0800

    Update gfx942 HHS/BBS/F8BS/F8B8BS NN/NT/TN Equality/GridBased yamls for models (#1499)

commit a9667ff6dfaf5a571b9f7e4e589067db1111971d
Author: Daine McNiven <[email protected]>
Date:   Fri Dec 27 04:54:01 2024 -0700

    Support conjugate-transpose as equivalent to transpose (#1429)

    * Support conjugate-transpose

    * Convert conj-transpose to transpose during problem construction.

    * ammendment to last commit

    * Remove unneeded use of arg.trans in aux tests.

commit cbb0c2ded50b6c0f0ae7ed8e217c653414a44fde
Author: Hao-Sheng Chen <[email protected]>
Date:   Fri Dec 27 15:20:29 2024 +0000

    Modify the tests to separate different hardware behavior

commit 0ebbb877d7ed792b99bd5ab32632ce52551966b2
Author: Hao-Sheng Chen <[email protected]>
Date:   Fri Dec 27 12:23:31 2024 +0000

    Fix incorrect verification method

commit 3bdace8bc0035d2104ed3bc7c9d7258f6465dc03
Author: Hao-Sheng Chen <[email protected]>
Date:   Tue Dec 24 11:56:10 2024 +0000

    Fix AMDClangVersion isn't assigned before running Tensile

commit 098f9e0c45fcc9e09e397be5111b90fb7778b331
Author: Hao-Sheng Chen <[email protected]>
Date:   Tue Dec 24 07:05:06 2024 +0000

    Fix the incorrect relative build path when it's not at the root level

commit de4c20447136f2b4940ca25224e0bf2fc68f89e9
Author: Hao-Sheng Chen <[email protected]>
Date:   Tue Dec 17 08:54:45 2024 +0000

    Fix garbage value of bias_type

commit 260c9ceffe3e77e9ea5413022a4c8aee29780049
Author: Serge Lu <[email protected]>
Date:   Fri Dec 27 18:24:57 2024 +0800

    Fixed random failures in MBSK post-loop (#1497)

    * Fixed missing srd increment for MBSK kernel

    * Re-enabled large sizes test cases for swizzle-A

    * Removed unnecessary changes

commit 4311f2d4af1bce6f857c8e0fbcf3bee7c807e0f2
Author: Jinp800125 <[email protected]>
Date:   Fri Dec 27 14:14:58 2024 +0800

    update 942 7 range F8HS Gridebase (#1476)

    Co-authored-by: victorwu <[email protected]>

commit 4f1215afc895ceec63a8d584429f5efe0ac57f27
Author: hcman2 <[email protected]>
Date:   Fri Dec 27 04:46:42 2024 +0800

    Remove some Equality problems. (#1494)

    Remove the sizes that gridbased is better.

commit d633749f107dbddd03f3a262adc484b6a5197e04
Author: briannwu <[email protected]>
Date:   Thu Dec 26 16:32:58 2024 +0800

    Using i32*u32->i64 when removing staggerU offset (#1487)

    Fix: memory access fault in tail loop  with extreme larger size

commit 2b8c9d8c604a530fd9316f8d4cb7af663aa0bed5
Author: tsenwang <[email protected]>
Date:   Wed Dec 25 08:08:01 2024 +0000

    cmake option

commit c883079a47aea8316cafaff4f142ad6d2250debe
Author: jichang <[email protected]>
Date:   Tue Dec 24 09:57:25 2024 +0000

    Fix : seg fault due to overflow in Uint32 Remainder calculation

commit 336aa6967910c8d0cca029986ee0781dbd0380e5
Author: hcman2 <[email protected]>
Date:   Thu Dec 26 00:29:31 2024 +0800

    Update TN input F8/F8B8/B8F8 and output F8/B8/B/H/S logic yaml. (#1489)

    1. modify the incorrect kernel name of
    aquavanjaram_Cijk_Alik_Bljk_F8B8BS_BH_BiasSB_HAS_SAB_SAV_UserArgs
    and
    aquavanjaram_Cijk_Alik_Bljk_F8BS_BH_BiasSB_HAS_SAB_SAV_UserArgs
    2. Update and add other yamls.

commit 26eba0f556c7cf4c7ecb976e4cd31b25f592340d
Author: Josh Chang <[email protected]>
Date:   Wed Dec 25 17:13:28 2024 +0800

    Fix returnAlgoCount typo (#1492)

commit 1893cdf267ec77bdcb12c149963bf34e63d85ff6
Author: Jinp800125 <[email protected]>
Date:   Wed Dec 25 09:45:05 2024 +0800

    Update F8B8BS and F8BS TN BBS NN, NT, TN Gridebase (#1486)

    Co-authored-by: victorwu <[email protected]>

commit 5997a6c7f72b575778b45d600ab07eb17628e6f1
Author: hcman2 <[email protected]>
Date:   Tue Dec 24 20:22:36 2024 +0800

    Update F8B8BS and F8BS TN logic yamls. (#1482)

    Update 2 logic yamls from row-wise logic yaml.

commit 473d638e2b568774aeb7609073828a3af0f66205
Author: jichang <[email protected]>
Date:   Mon Dec 23 10:37:29 2024 +0000

    Fix: seg fault when execute some solutions with extreme larger size

commit 3ef4fb26ca098ec0d5b4df68b7151fd71b59c229
Author: hcman2 <[email protected]>
Date:   Tue Dec 24 10:33:06 2024 +0800

    Update F8F8S logic yaml and enable GSU+F8 (#1479)

    1. Remove the rejection of GSU+F8 output.
    2. Update aquavanjaram_Cijk_Alik_Bljk_F8F8S_BH_BiasSH_HAS_SAB_SCD_SAV_UserArgs.yaml

commit 61c3ba5ab7f8b5aede8bc678548db05e656c1181
Author: mahmoodw <[email protected]>
Date:   Mon Dec 23 10:39:25 2024 -0700

    Changes to exclude StreamK by default (#1471)

    Co-authored-by: mahmoodw <[email protected]>

commit 09ba034b3148a25f4b876e1a2f9192ab585530f8
Author: hcman2 <[email protected]>
Date:   Mon Dec 23 10:04:20 2024 +0800

    Use archVGPR when accVGPR is not enough. (#1460)

    This PR is to support larger MT such as 256x320.
    Generally, we only have 256 accVGPRs.
    If MT is larger than 256x256, we need some extra archVGPR to store the acc
    results.

commit 0c8494a3d569a996330c15fac4dde9ff8ce8691a
Author: AndySu12 <[email protected]>
Date:   Mon Dec 23 03:10:20 2024 +0800

    Update gfx942 F8F8S TN Equality/GridBased yamls for torch._scaled_mm fix (#1477)

commit 0579107e30a2ef5b7970e288a20f21e81ac559fe
Author: smalekta <[email protected]>
Date:   Fri Dec 20 15:29:30 2024 -0500

    Gfx942 80cu grid based and equality tuning for HHS NN (#1474)

    * adding point to fix the grid before tuning

    * added point to HHS_NN grid for 80cu to make the grid denser

    * adding a size equality HHS_NN for 80cu

commit f5fd734754526c0e5398bf6a92cfb26c1ab103bf
Author: Feroz <[email protected]>
Date:   Fri Dec 20 08:54:39 2024 -0600

    Aquavanjaram 20CU equality GEMMM tuning updates for TF32 NN and TN data type (#1470)

    * Aquavanjaram 20CU equality NN TN TF32 GEMM sizes tuned

commit 5ca877f6aeccdba1aba99235fef2bb1be62a6768
Author: Ethan <[email protected]>
Date:   Fri Dec 20 12:41:33 2024 +0800

    code-gen: improved tail loop and edge tile of swizzled A

    * Opt swizzleA tail-loop and minor bug fix

    * Add test cases for tail loop

    * disable big sizes

commit 6b413e322e57660a7c6803d59a321bd96530521c
Author: Hao-Sheng Chen <[email protected]>
Date:   Fri Dec 20 10:30:10 2024 +0800

    Fix incorrect type casting for alpha and beta in f16 compute type (#1462)

    * Fix incorrect type casting for alpha and beta in f16 compute type

commit 73a04a0b69ea27e87542317d9fff556a140c2045
Author: mengzcai <[email protected]>
Date:   Tue Dec 10 17:59:50 2024 +0000

    Support UseSgprForGRO for dtva

commit 322d6b10308ba88b6552847a9cd5759bf049b168
Author: George Tseng <[email protected]>
Date:   Thu Dec 19 11:47:18 2024 +0800

    Update gfx942 NT/TN/NN FP8/BF8/BF16 Equality (#1463)

    * update 48 Equality logic yaml sizes.

    * update bbs logic yamls.

commit bffef5e46d9acc80db5daa6d0c4579dfbd772b85
Author: Josh Chang <[email protected]>
Date:   Wed Dec 18 16:20:22 2024 +0800

    gfx942_80cu BBS NN NT TN Tuning Release (#1459)

    1. Large K and large N cases

commit 856fec23bb11d12e382273badf67fbe1bf6cfda3
Author: Hao-Sheng Chen <[email protected]>
Date:   Wed Dec 18 15:59:04 2024 +0800

    Revert "Fix incorrect type casting for alpha and beta in f16 compute type"

    This reverts commit 3600099416461e108e5e14c6e49e30acfb66c9e9.

commit aa3a2ec5e572cb7c0868640fcb4ca263a8340499
Author: NaveenElumalaiAMD <[email protected]>
Date:   Tue Dec 17 10:52:14 2024 -0700

    Fix invalid string printed when running hipblaslt-test (#1428)

    * Fix invalid printed when running hipblaslt-test

    * return -1 for bad values

commit 3600099416461e108e5e14c6e49e30acfb66c9e9
Author: Hao-Sheng Chen <[email protected]>
Date:   Sun Dec 15 20:33:57 2024 +0000

    Fix incorrect type casting for alpha and beta in f16 compute type

commit 4d40e360f2a6b82888451e2669a456ba1717932a
Author: hcman2 <[email protected]>
Date:   Tue Dec 17 10:39:21 2024 +0800

    Update solutions for hstu bmm 3 sizes. (#1453)

commit b61a09590ee8f437055086f29156c704a68f6385
Author: KKyang <[email protected]>
Date:   Tue Dec 17 10:38:48 2024 +0800

    Update gfx942_80cu NT/TN/NN f16/f32 Equality (#1452)

commit 83930f22138831b276ffaa8f0d39075fe425a268
Author: Mark Kim <[email protected]>
Date:   Mon Dec 16 12:54:23 2024 -0600

    gfx942 38cu HSS/BSS NN TN NT grid tune (#1448)

commit b089e406cdcc469de1fa829ec6926ea530f32791
Author: Mark Kim <[email protected]>
Date:   Mon Dec 16 09:04:18 2024 -0600

    gfx942 38cu SGEMM NN TN NT grid tune (#1439)

commit ba818b0e7faffa76587bd8ab0ad175c44fb0d68f
Author: Mark Kim <[email protected]>
Date:   Mon Dec 16 09:04:02 2024 -0600

    gfx942 38cu F8HS NN TN NT grid tune (#1440)

commit 71295a0e9f888938be6f9c8939ac80a24752b23c
Author: Vin Huang <[email protected]>
Date:   Sat Dec 14 13:34:32 2024 +0000

    Refactory Sparse Test in Tensilelite

commit 41767d9554163403ff9c54d5f07aa2b0d14d110a
Author: KKyang <[email protected]>
Date:   Mon Dec 16 13:46:04 2024 +0800

    [Hotfix] correct occupancy calculation (#1451)

commit 0ed9795b44579c98218bcdf20cf3f41aa5e622b9
Author: David Dixon <[email protected]>
Date:   Sat Dec 14 16:06:17 2024 +0000

    glod for yaml and co in client writer

commit b263c6bf45a99cdd374b77798114f6f08413d5c0
Author: David Dixon <[email protected]>
Date:   Fri Dec 13 17:31:55 2024 +0000

    remove warning

commit b79b0320b56b8924093811386287cfee9d5d6a74
Author: David Dixon <[email protected]>
Date:   Fri Dec 13 17:16:19 2024 +0000

    newline at eof

commit 0e5b7bf679b770c775a479c5db1759c2be70f101
Author: David Dixon <[email protected]>
Date:   Fri Dec 13 17:14:29 2024 +0000

    remove := usage for py36 and add LibraryClient test

commit 6529544fd206a665c7758e19e578847ab9b83398
Author: David Dixon <[email protected]>
Date:   Fri Dec 13 13:36:39 2024 +0000

    Patch client writer

commit 4654ea4e160746882bbd915ce189f4786db7a916
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Fri Dec 13 17:29:49 2024 -0700

    Bump rocm-docs-core from 1.10.0 to 1.11.0 in /docs/sphinx (#1427)

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.10.0 to 1.11.0.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.10.0...v1.11.0)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 545cc7ecbfcfebea013026a2a2e96b022f11fae5
Author: David Dixon <[email protected]>
Date:   Fri Dec 13 11:52:56 2024 -0700

    Remove functions for generating file paths  (#1402)

    * Remove functions for building paths to files and validation

commit d1c7b59227cdf56b2bdf472d8b90f1415fe15782
Author: AndySu12 <[email protected]>
Date:   Fri Dec 13 19:26:27 2024 +0800

    Update gfx942 BBS NN/NT/TN Equality/GridBased yamls for 1204 bbs bmm dynamic (#1446)

commit 88fef9953dba5e0fbce8c106ab69f30482467f90
Author: KKyang <[email protected]>
Date:   Fri Dec 13 10:28:38 2024 +0800

    Enable setOccupancyLimit for MBSK (#1438)

    * Refactor MBSK related functions

    * FIx vgpr occupancy not calculated correctly in unified mode

    * Support setOccupancyLimit for MBSK

commit e453aebb2899e11534e6d0c45a7114e755838b83
Author: Braden Stefanuk <[email protected]>
Date:   Thu Dec 12 08:51:42 2024 -0700

    Verify toolchain components at program invocation (#1413)

    * feat: toolchain components not globally accessible
    * feat: update version acquisition
    * chore: remove extraneous compiler verifications

commit 8824d1acc92dc5bc3558638ade5dcb27ced8060b
Author: Parth Kumar <[email protected]>
Date:   Thu Dec 12 08:54:31 2024 -0600

    Tune Aquavanjaram942X F8F8 and F8BF8 TN for equality (#1430)

    * added F8F8 F8BF8 kernels

    * removed underperforming sizes

    * review comment fix

commit e13e13351a8167fefec51974786f3822a6e38281
Author: hcman2 <[email protected]>
Date:   Thu Dec 12 10:36:01 2024 +0800

    LSU supports larger MT and reuse LDS. (#1433)

    1. Move LSU into LSU.py.
    2. Do partial LSU when the LDS is not enough.

commit ebd940f8e0351ae212280f2eb60f7d2717d4f72b
Author: David Dixon <[email protected]>
Date:   Wed Dec 11 11:02:44 2024 -0700

    Remove manifest functionality (#1401)

    * Remove manifest functionality
    * Remove manifest usage from cmake and client writer

commit 573675edba6e5f0405f3033667194dcf731d2e80
Author: KKyang <[email protected]>
Date:   Wed Dec 11 14:03:07 2024 +0800

    Support gridbased kdtree search for batched gemm (#1417)

commit 82d40441735326edc918c066191eabb8ea03c82e
Author: Feroz <[email protected]>
Date:   Tue Dec 10 21:35:14 2024 -0600

    Tune aquavanjaram942 20CU equality TF32 NN GEMM (#1432)

commit c92591ad11eb5c2e38f7a92f43626e13fbe714f6
Author: Henry Ho <[email protected]>
Date:   Thu Dec 5 09:24:03 2024 +0800

    remove code owner

commit ec597a6b5132c3573cbcfc8dd29a04620699a991
Author: Henry Ho <[email protected]>
Date:   Tue Nov 5 09:58:58 2024 +0800

    extend MBSK WS to 40MB

commit a62fb4e879da0028b5b6d51166dc8b084b0e5343
Author: Braden Stefanuk <[email protected]>
Date:   Tue Dec 10 09:12:08 2024 -0700

    Remove unsupported assignment expression operator (#1434)

commit 6190ac0ce61e45fbbc240b2fb38c31f3d1049144
Author: David Dixon <[email protected]>
Date:   Tue Dec 10 09:00:31 2024 -0700

    Add alternative tool for incremental builds when tuning (#1431)

    * Add alternative tool for incremental builds when tunng
    * Ensure rebuild occurs when files are removed
    * Compute ARCH from co file and use ROCM_PATH
    * Simplify wavefrontsize

commit dd826190a36b6a2cad5504fe774cdf8160835234
Author: David Dixon <[email protected]>
Date:   Tue Dec 10 05:26:04 2024 -0700

    Reduce/clean up TCL output (#1426)

    * Reduce/clean up TCL output

commit 79fdf7e2304729ed526940acaa373a315e100cde
Author: David Dixon <[email protected]>
Date:   Mon Dec 9 06:30:03 2024 -0700

    Remove unused write functions (#1400)

commit 20590e11ab919357eab329fae27d9729617f4f63
Author: David Dixon <[email protected]>
Date:   Mon Dec 9 06:29:27 2024 -0700

    Remove PackageLibrary option (#1367)

    * Remove PackageLibrary option

commit 1ef6b6dafb00cea5f7edf77f6fdac53ce56a5fe5
Author: David Dixon <[email protected]>
Date:   Mon Dec 9 06:28:48 2024 -0700

    Removes client config option (#1404)

    * Removes client config option
    * Remove unused global

commit 7e7fcfee815e9f234b7f1ab8665b390755c25657
Author: wencchen <[email protected]>
Date:   Mon Dec 2 07:33:48 2024 +0000

    add f32 equality logic yamls

commit 4b13967503b2658afcc0d84f1f7e6da23e923736
Author: KKyang <[email protected]>
Date:   Mon Dec 9 13:59:18 2024 +0800

    Fix cpuThreads == 0 not working properly (#1337)

commit a457413b9a10c42f205408997547c9a7480bd1d3
Author: briannwu <[email protected]>
Date:   Mon Dec 9 11:34:14 2024 +0800

    [OPT] Optimize tail loop (#1353)

    Use buffer_load_dword instructions instead of buffer_load_d16
    If tailLoopOpt is False, original method(GuardK) will be used.
    Not supported: WaveSeparateGlobalRead1/2, DTV, LocalSplitU, SparseGemm

commit c1f9582f7ca4aa5e60ff7ac91710eed1e0fdb4fc
Author: Braden Stefanuk <[email protected]>
Date:   Fri Dec 6 15:20:12 2024 -0700

    Code object compression via bundling (#1374)

    * feat: compress code objects
    * feat: add --no-compress flag

commit d43d84a8ce849de9494f70221884bfb3a8fd7b9e
Author: David Dixon <[email protected]>
Date:   Fri Dec 6 06:33:54 2024 -0700

    Remove client/benchmark  write functions (#1398)

commit 8f687dba4b8e4d320f3a9507563364c12991a0cd
Author: David Dixon <[email protected]>
Date:   Fri Dec 6 06:33:20 2024 -0700

    Remove embed library option (#1403)

commit a20a7b674bdf7c9d23f176b25507a46b13abbfa9
Author: wencchen <[email protected]>
Date:   Tue Nov 19 18:22:34 2024 +0800

    f32 mac bug fix

commit 71fc08c5862868ba10a2c2b30415b50860218476
Author: Hao-Sheng Chen <[email protected]>
Date:   Tue Dec 3 07:26:36 2024 +0000

    Add f8, i8 and f16 logic yaml on gfx12

commit 7bf065b2a33b7e108fbe7296c8ebdb6c4e292d13
Author: Feroz <[email protected]>
Date:   Thu Dec 5 17:52:24 2024 -0600

    Aquavanjaram942 20CU Tune HHS NN and TN GEMM sizes for equality and grid lib (#1390)

    Co-authored-by: Feroz <[email protected]>

commit d16c2049ecf4ed9ede0173c463ce9b15b056a69f
Author: David Dixon <[email protected]>
Date:   Thu Dec 5 08:25:50 2024 -0700

    Remove write cmake function (#1397)

commit 5fb2a4f9168aaab184396f08a3d74da36162ce84
Author: Alex Brown <[email protected]>
Date:   Thu Dec 5 07:39:14 2024 -0700

    Add initial optional stream-k libraries (#1347)

commit 3f8ae669ae53bc8fb674275128395f84787a850f
Author: Alex Brown <[email protected]>
Date:   Thu Dec 5 07:38:41 2024 -0700

    Fix stream-k kernels when K==0, allow beta*C calculation (#1399)

commit 8ddfcabd8dc8b720ec9a4f3f9c3c57ada85d311d
Author: Alex Brown <[email protected]>
Date:   Wed Dec 4 09:07:46 2024 -0700

    Stream-k batch predict (#1410)

    * Add batch to grid predictor, and modify calculation of fixup peers
    * Disable fixup peers change for now, requires coefficient changes

commit 663de3e1fbcc9d80f8db92e69eaf3090069d7a6e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Dec 2 11:45:33 2024 +0000

    Bump rocm-docs-core from 1.9.2 to 1.10.0 in /docs/sphinx

    Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.9.2 to 1.10.0.
    - [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
    - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
    - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.9.2...v1.10.0)

    ---
    updated-dependencies:
    - dependency-name: rocm-docs-core
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <[email protected]>

commit b25be168c0dc1013e5d59c79fb964f338b215e3c
Author: jichang <[email protected]>
Date:   Mon Dec 2 05:48:59 2024 +0000

    tensilelite:  executing asm-new.sh can be in different folder

commit 52c9825671cec110bb2de3bff3c73c3c14df3aec
Author: Feroz <[email protected]>
Date:   Tue Dec 3 10:12:47 2024 -0600

    Tune Aldebaran BF16 NN TN NT GEMM sizes (#1323)

    * Tune Aldebaran BF16 NN TN NT GEMM sizes

commit d7dc5cc4d6b741e4770c6f3bf44fe2ecc52820ac
Author: Serge Lu <[email protected]>
Date:   Tue Dec 3 07:16:23 2024 +0000

    Temporarily remove some test cases for swizzle-A

commit 233894ebc00be0be784e3591e05df36e94fa1d57
Author: who who who <[email protected]>
Date:   Tue Dec 3 15:14:34 2024 +0800

    Fp8 tuning upstream (#1380)

    * add fp8 tuning config

    * fix bugs

    * fix arch name for mi308 and mi210

    * add scale and bias

    * add full mi

    * fix bugs

    * fix bugs

    * add activation pattern

    * fix a bug

    * fix merge bug when inData has no soltion

    * limit wave size less or equal than 4

    * update readme

    * fix full stage

    * update pattern to match latest hipblaslt log

    * fix activation

    * update log example

    * apply full stage in gridbase tuning

    * fix dulplicate problem size introduced by solution index

    * pass unsupported gemm config

    * calculate gsu

    * fix a bug

    * fix a bug

    * skip slow kernels

commit f52a366cd4dd80ea2ea32706b954b231f987c836
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.gith…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants